The autoreload extension is already loaded. To reload it, use:
%reload_ext autoreload
The autoreload extension is already loaded. To reload it, use:
%reload_ext autoreload
Consider all pairs of (non-overlapping) subsequences of length \(w\) of a given time series. The first \(k\) by increasing distance are the top-\(k\) motifs.
Still, it’s a \(\Omega(n^2)\) algorithm
Finds motifs out of 1 billion long time series in one day, using 5 GPUs
To address these challenges, we will use an approach based on Locality Sensitive Hashing (LSH)
We consider the z-normalized Euclidean distance \[
d(x, y) = \sqrt{\sum_{i\in[1,w]} \left(
\frac{x_i - \bar{x}}{\sigma_x}
-
\frac{y_i - \bar{y}}{\sigma_y}
\right)^2
}
\] In this example we have a window length \(w = 640\),
hence we have vectors in \(\mathbb{R}^{640}\)
For \(r \in \mathbb{R}^+\), \(\vec{a} \sim \mathcal{N}(0,1)^w\), and \(b \sim \mathcal{U}(0,r)\) Hash function: \[ h(\vec{x}) = \left\lfloor\frac{\vec{a} \cdot \vec{x} + b}{r}\right\rfloor \]
The key point is that we only compute the distance of subsequences falling
into the same bucket.
To lower the collision probability we concatenate \(k\) hash functions \[ \hat{h}(\vec{x}) = \langle h_1(\vec{x}), \dots, h_k(\vec{x}) \rangle \] this makes for a better precision of the index.
And to increase the recall of the index we repeat \(L\) times.
Repeating the hashing procedure several times we will see close pairs of subsequences at least once.
We maintain a priority queue of the top pairs seen across repetitions.
By performing enough repetitions, we can ensure that it is unlikely that we missed pairs that are closer than the current top-\(k\) pairs.
Formally, we can ensure that we find the exact top-\(k\) motifs with probability \(1-\delta\), for a user defined \(\delta\).
The smaller the \(\delta\), the higher the number of repetitions.
In each iteration we compute the distance of all subsequences in the same bucket.
In the first iteration, with \(k=4\), we discover a pair at distance 2
.
After 10 repetitions, we find a pair at distance 1
.
After 100 repetitions we did not find a better pair,
and the success probability is about 0.85
We then consider shorter prefixes of the hashes,
going through the 100 repetitions again.
After 15 repetitions, we observe that the
success probability is above our target, and thus return.
Theorem 1 The LSH index construction takes time \[ O(K \cdot \sqrt{L} n\log n) \]
Theorem 2 Let \(d(m_k)\) be the distance of the \(k\)-th motif, and \(i'\le K\), \(j' \le L\) be the parameters used by the optimal LSH algorithm. Then, the algorithm considers \[ O\left( j'\sum_{m\in T^w\times T^w} p(d(m))^{i'} + (L-j')\sum_{m\in T^w\times T^w} p(d(m))^{i'+1} \right) \] pairs in expectation.
Find the top-10 motifs.
| dataset | \(n\) (millions) | RC |
|---|---|---|
| astro | 1 | 8.63 |
| GAP | 2 | 9.17 |
| freezer | 7 | 7.95 |
| ECG | 7 | 109.06 |
| HumanY | 26 | 581.03 |
| Whales | 308 | 21.66 |
| Seismic | 1000 | 274.44 |
Relative Contrast measures difficulty: higher is easier. \[ RC = \frac{d_{avg}}{d_{m_k}} \]
Data from LIGO:
Attimo: 6 hoursSCAMP: \(\approx 1\) secondSynthetic data, planted motifs with different relative contrasts. SCAMP-gpu only has one line since it is data-independent.
Attimo only provides information about the top motif(s), whereas the Matrix Profile provides other informtaion.
For shorter time series, or if the relative contrast is very small, use the Matrix Profile.
For time series of a few million values and above, with a not-to-small relative contrast, try Attimo
Running for top-10 motifs, for different number of repetitions.
freezer and the 7-th motif7M points, window 5000